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Abstract 



We provide rigorous guarantees on learning with the weighted trace-norm under arbitrary 
sampling distributions. We show that the standard weighted trace-norm might fail when the 
sampling distribution is not a product distribution (i.e. when row and column indexes are 
not selected independently), present a corrected variant for which we establish strong learning 
guarantees, and demonstrate that it works better in practice. We provide guarantees when 
weighting by either the true or empirical sampling distribution, and suggest that even if the 
true distribution is known (or is uniform), weighting by the empirical distribution may be 
beneficial. 



1 Introduction 

One of the most common approaches to collaborative filtering and matrix completion is trace-norm 
regularization (XJ [21 El S] . In this approach we attempt to complete an unknown matrix, based on a small 
subset of revealed entries, by finding a matrix with small trace-norm, which matches those entries as best 
as possible. 

This approach has repeatedly shown good performance in practice, and is theoretically well under- 
stood for the case where revealed entries are sampled uniformly [SJ [HI El El [HI HO] • Under such uniform 
sampling, 0(nlog(n)) entries are sufficient for good completion of an n x n matrix — i.e. a nearly con- 
stant number of entries per row. However, for arbitrary sampling distributions, the worst-case sample 
complexity lies between a lower bound of J7(n 4 / 3 ) [IT] and an upper bound of 0(n 3 / 2 ) [T3], i.e. requiring 
between ro 1 / 3 and n 1 / 2 observations per row, and indicating it is not appropriate for matrix completion 
in this setting. 

Motivated by these issues, Salakhutdinov and Srebro [11] proposed to use a weighted variant of the 
trace-norm, which takes the distribution of the entries into account, and showed experimentally that this 
variant indeed leads to superior performance. However, although this recent paper established that the 
weighted trace-norm corrects a specific situation where the standard trace-norm fails, no general learning 
guarantees are provided, and it is not clear if indeed the weighted trace-norm always leads to the desired 
behavior. The only theoretical analysis of the weighted trace-norm that we are aware of is a recent report 
by Negahban and Wainwright [9] that provides reconstruction guarantees for a low-rank matrix with 
i.i.d. noise, but only when the sampling distribution is a product distribution, i.e. the rows index and 
column index of observed entries arc selected independently A product distribution assumption does 
not seem realistic in many cases — e.g. for the Netflix data, it would indicate that all users have the same 
(conditional) distribution over which movies they rate. 



In this paper we rigorously study learning with a weighted trace-norm under an arbitrary sampling 
distribution, and show that this situation is indeed more complicated, requiring a correction to the 
weighting. We show that this correction is necessary, and present empirical results on the Netflix and 
MovieLens dataset indicating that it is also helpful in practice. We also rigorously consider weighting 
according to either the true sampling distribution (as in [5]) or the empirical frequencies, as is actually 
done in practice, and present evidence that weighting by the empirical frequencies might be advantageous. 
Our setting is also more general then that of [9] — we consider an arbitrary loss and do not rely in 
i.i.d. noise, instead presenting results in an agnostic learning framework. 

Setup and Notation. We consider an arbitrary unknown n x m target matrix Y, where a subset of 
entries {Y it j t }f =1 indexed by S = {(h,ji), ■ ■ ■ , (i s ,is)} is revealed to us. Without loss of generality, 
we assume n > to. Throughout most of the paper, we assume S is drawn i.i.d. according to some 
sampling distribution p(i,j) (with replacement). Based on this subset on entries, we would like to fill 
in the missing entries and obtain a prediction matrix X$ € M nxm , with low expected loss L p (Xs) = 
Eij^p £((Xs)ij,Yij) , where £(x, y) is some loss function. Note that we measure the loss with respect to 

the same distribution p(i,j) from which the training set is drawn (this is also the case in [TTl 191 IT2"]). 
Given some distribution p{i,j) on [n] x [m], the weighted trace-norm of a matrix X £ flj™ xm } s given 

by EH 



|-^1ltr(p**,p<=) 



diag(p r ) 1/2 -X-diag(p c ) 1/2 



where p r £ M. n and p c £ M. m denote vectors of the row- and column-marginals respectively. Note that 
the weighted trace-norm only depends on these marginals (but not their joint distribution) and that if p r 
and p c are uniform, then ||-X"|| tr ^r pe j = ^= ll^ll tr - The weighted trace-norm does not generally scale 
with n and to, and in particular, if X has rank r and entries bounded in [—1,1], then ||^|| tr ( p r pC \ < \/r 
regardless of which p is used. This motivates us to define the class 

W r [p] = {X£ M" xm : ||X|| tr(prip0) < V~r], 

although we emphasize that our results do not directly depend on the rank, and W r \p] certainly includes 
full-rank matrices. We analyze here estimators of the form X$ — argmin{is(X) : X £ W r [p]} where 
Ls(X) = - J2t=i Yitjt) 1S the empirical error on the observed entries. 

Although we focus mostly on the standard inductive setting, where the samples are drawn i.i.d. and 
the guarantee is on generalization for future samples drawn by the same distribution, our results can also 
be stated in a transductive model, where a training set and a test set are created by splitting a fixed 
subset of entries uniformly at random (as in |12|). The transductive setting is discussed in Section 4.2, 
and variants of our Theorems in this setting are found there and in Appendix |B| 

2 Learning with the Standard Weighting 

In this Section, we consider learning using the weighted trace-norm as suggested by Salakhutdinov and 
Srebro |llj . i.e. when the weighting is according to the sampling distribution p(i,j). Following the 
approach of [S] and [TU], we base our results on bounding the Rademacher complexity of W r [p], as a 
class of functions mapping index pairs to entry values. However, we modify the analysis for the weighted 
trace-norm with non-uniform sampling. 

For a class of matrices X and a sample S = {(«i, ji), ■ ■ • , (i s , j s )} °f indexes in [n] x [to], the empirical 
Rademacher complexity of the class (with respect to S) is given by 



sup - V o t X ltjt 



where a is a vector of signs drawn uniformly at random. Intuitively, TZ$(X) measures the extent to which 
the class X can "overfit" data, by finding a matrix X which correlates as strongly as possible to a sample 
from a matrix of random noise. For a loss £(x, y) that is Lipschitz in x, the Rademacher complexity can 
be used to uniformly bound the deviations \L p (X) — Ls(X) \ for all X £ X, yielding a learning guarantee 
on the empirical risk minimizer [13 . 
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2.1 Guarantees for Special Sampling Distributions 

We begin by providing guarantees for an arbitrary, possibly unbounded, Lipschitz loss £(x,y), but only 
under sampling distributions which are either product distributions (i.e. p(i,j) = P r (i)p c (j)) or have 
uniform marginals (i.e. p r and p c are uniform, but perhaps the rows and columns are not independent). 
In Section |2.3| below, we will see why this severe restriction on p is needed. 



Theorem 1. For an l-Lipschitz loss £, fix any matrix Y , sample size s, and distribution p, such that p 
is either a product distribution or has uniform marginals. 

Let X$ — argmin |is(X) : X G W r \p]\- Then, in expectation over the training sample S drawn 

i.i.d. from the distribution p , 



L P (X S ) < inf L p (X) +0[l 
xew r [ P ] \ 



rn log(n) 



(1) 



Here and elsewhere we state learning guarantees in expectation for simplicity, but all guarantees can 
also be obtained with high probability. 

Proof. We will show how to bound the expected Rademacher complexity E5 TZs(W r [p]) , from which 

the desired results follows using standard arguments [T3"] . 

Following |10) by including the weights, using the duality between spectral norm || -|| sp and trace-norm, 
we compute: 



n s (w r \p]) 



E 



S.a 















^E 

s 








sp_ 






t=i 


sp_ 



~[ \/p r (it)p c (jt) 

€ K nxm . Since the Q t 's are i.i.d. zero-mean matrices, 



where e id = e^ej and Q t = a t . ' t,Jt = 
Theorem 6.1 of [Hj, combined with Remarks 6.4 and 6.5 there, establishes that 



E 



S.a 



sp 



O (ay/log(n) +R\og(n) 



where ||Qt|| sp < R (almost surely) and a 2 = max 1 1|^ E [Qf Q t ] || gp , ||^ E [Q t Qf ] || sp |- Calculating 
these (see Appendix 



A 



we get R < 



mini j j {np r (i)-mp c (j)} 



and 



cr < 



\ 



s max < max 



E 



— — — — , ma, 

p r {i)p c {j) 3 



P(i,j) 



Wp c (i) j 



< 



^i,j{ n P r (i) - m P c U)} 



If p has uniform row- and column-marginals, then for all i,j, np r (i) — mp c (j) = 1. This yields 

M) 



E. 



< o 



rn log(n) 



as desired. (Here we assume s > nlog(n), since otherwise we need only establish that excess error is 
0(ly/r), which holds trivially for any matrix in W r [p].) 

If p does not have uniform marginals, but instead is a product distribution, then the quantity R 
defined above is potentially unbounded, so we cannot apply the same simple argument. However, we can 
consider the "p-truncated" class of matrices 



Z(X)=(X ij lip(i,j) > 



log(n) 



X e W r [p] 
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By a similar calculation of the expected spectral norms, we can now bound Eg 



n s {z) 



< o 



tlog(») 



Applying [T3], this bounds \L p {Z{X s )) — L s {Z{Xs))j (in expectation). Since Z{Xs)a ^ (Xs)ij only on 
the extremely low-probability entries, we can also bound \L p (Xs) — L p (Z(Xs))j and (Lg(Z(Xs)) — Ls(Xs 
Combining these steps, we can bound ( L p (Xs) — Ls(Xs) ) . We similarly bound Ls(X*) — L p (X*), where 



X* = argminjsfgvVrfp] L P (X). Since Ls(Xs) < L$(X*), this yields the desired bound on excess error. 
The details are given in Appendix [SJ □ 

Examining the proof of Theorem [TJ we see that we can generalize the result by including distributions 
p with row- and column-marginals that are lower-bounded. More precisely, if p satisfies p r (i) > 
P c (j) > for all then the bound (fll) holds, up to a factor of C. Note that this result does not 
require an upper bound on the row- and column-marginals, only a lower bound, i.e. it only requires that 
no marginals are too low. This is important to note since the examples where the unweighted trace-norm 
fails under a non-uniform distribution are situations where some marginals are very high (but none are 
too low) [TT]. This suggests that the low-probability marginals could perhaps be "smoothed" to satisfy 
a lower bound, without removing the advantages of the weighted trace-norm. We will exploit this in 
Section [3] to give a guarantee that holds more generally for arbitrary p, when smoothing is applied. 

2.2 Guarantees for bounded loss 

In Theorem[TJ we showed a strong bound on excess error, but only for a restricted class of distributions p. 
We now show that if the loss function £ is bounded, then we can give a non-trivial, but weaker, learning 
guarantee that holds uniformly over all distributions p. Since we are in any case discussing Lipschitz loss 
functions, requiring that the loss function be bounded essentially amounts to requiring that the entries 
of the matrices involved be bounded. That is, we can view this as a guarantee on learning matrices with 



bounded entries. In Section 2.3 below, we will show that this boundedness assumption is unavoidable if 



we want to give a guarantee that holds for arbitrary p. 

Theorem 2. For an l-Lipschitz loss £ bounded by b, fix any matrix Y , sample size s, and any distribution 
p. Let Xs — argmin |is(AT) : X £ WV [p]| for r > 1. Then, in expectation over the training sample S 
drawn i.i.d. from the distribution p, 



The proof is provided in Appendix [Al and is again based on analyzing the expected Rademacher 



3 / rn log(n) 



complexity, E s H{£ o W r \p\) < O 1(1 + b) 
2.3 Problems with the standard weighting 

In the previous Sections, we showed that for distributions p that are either product distributions or have 
uniform marginals, we can prove a square-root bound on excess error, as shown in ([T]). For arbitrary p, 
the only learning guarantee we obtain is a cube-root bound given in (pi) , for the special case of bounded 
loss. We would like to know whether the square-root bound might hold uniformly over all distributions 
p, and if not, whether the cube- root bound is the strongest result that we can give in this case for the 
bounded-loss setting, and whether any bound will hold uniformly over all p in the unbounded-loss setting. 

The examples below demonstrate that we cannot improve the results of Theorems [l] and [2] (up to 
log factors), by constructing degenerate examples using non-product distributions p with non-uniform 
marginals. Specifically, in Example 1, we show that in the special case of bounded loss, the cube-root 
bound in[2]is the best possible bound (up to the log factor) that will hold for all p, by giving a construction 
for arbitrary n — m and arbitrary s < nm, such that with 1-bounded loss, excess error is $7 (y^)- In 
Example 2, we show that with unbounded (Lipschitz) loss, we cannot bound excess error better than a 
constant bound, by giving a construction for arbitrary n — m and arbitrary s < nm in the unbounded- 
loss regime, where excess error is O(l). For both examples wc fix r = 1. We note that both examples 



4 



can be modified to fit the transductive setting, demonstrating that smoothing is necessary also in the 
transductive setting as well. 

Example 1. Let £(x, y) — min{l, \x — y\} < 1, let a = (2s/n) 2 / 3 < n, and let matrix Y and block-wise 
constant distribution p be given by 



.4 







Y 







0, 



{n—ajx-^ (7i— a) X ^ 



(p (hj)) 



2n 











1, 



(n— a)X"2 (n— a)^ (n — a)x 



Yijl{ij e S}, and note that ||^ S || tr(pr pC) < 1- Since L S (Y S ) 



0, it clearly 



t^, where N is the number of ±l's in Y which are not observed in the 

i \ 1 3 ffr 



where A e {±1}° 2 is any sign matrix. Clearly, H^Hwpr^c) < 1, and so mixew r [p] L p {X) = 0. Now 
suppose we draw a sample S of size s from the matrix y, according to the distribution p. We will show 
an ERM Y such that in expectation over S, L P (Y) > | ^/J. 

Consider where K-f 
an ERM. We also have L p (Y 

sample. Since E [N] > <f , we see that E [L p (Y s )] > ^ 

Example 2. Let £(a;,y) = \x - y\. Let Y = 0„ xn ; trivially, F e W r [p]. Let p(l, 1) = j, and 
p (i, 1) = p (1, j) = for all i, j > 1, yielding p r (1) = p c (1) = i. (The other entries of p may be defined 

arbitrarily.) We will show an ERM Y such that, in expectation over S, L p (Y) > 0.25. Let A be the 
matrix with X\\ = s and zeros elsewhere, and note that ||^4|| tr ( p r pC \ = 1- With probability > 0.25, entry 

(1, 1) will not appear in S, in which case Y — A is an ERM, with L p (Y) = 1. 

The following table summarizes the learning guarantees that can be established for the (standard) 
weighted trace-norm. As we saw, these guarantees are tight up to log-factors. 





1-Lipschitz, 1-bounded loss 


1-Lipschitz, unbounded loss 


p = product 


/ rn log(n) 


/ rn log(n) 


V s 


V s 


p r ,p c = uniform 


/ rn log(n) 


/ rn log(n) 


V - 


V s 


p arbitrary 


3 / rn log(n) 

V s 


i 



3 Smoothing the weighted trace norm 

Considering Theorem [T] and the degenerate examples in Section [273] , it seems that in order to be able to 
generalize for non-product distributions, we need to enforce some sort of uniformity on the weights. The 
Rademacher complexity computations in the proof of Theorem [JJ show that the problem lies not with 
large entries in the vectors p r and p c (i.e. if p r and/or p c are "spiky"), but with the small entries in these 
vectors. This suggests the possibility of "smoothing" any overly low row- or column-marginals, in order 
to improve learning guarantees. 

In Section |3.1| we present such a smoothing, and provide guarantees for learning with a smoothed 
weighted trace-norm. The result suggests that there is no strong negative consequence to smoothing, but 
there might be a large advantage, if confronted with situations as in Examples 1 and 2. In Section [3~2] 
we check the smoothing correction to the weighted trace-norm on real data, and observe that indeed it 
can also be beneficial in practice. 

3.1 Learning guarantee for arbitrary distributions 

Fix a distribution p and a constant a £ (0, 1), and let p denote the smoothed marginals: 

p^ l )^a- P r ( l ) + (l~a)-i l , pc(j)= a -p c (j) + (l-a)-±. (3) 

In the theoretical results below, we use a = 4, but up to a constant factor, the same results hold for any 
fixed choice of a £ (0, 1). 

Theorem 3. For an l-Lipschitz loss t, fix any matrix Y , sample size s, and any distribution p. Let 
Xs = argmin |l,s(X) : X£ W r [p]|. Then, in expectation over the training sample S drawn i.i.d. 
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from the distribution p, 



L P {X S ) < inf L p (X) + O 
xew r [p] 




rn log(n) 



(4) 



Proof. We bound Es^ p 



< O 



rn log (71 



n s {w r [ 

bound is essentially identical to the proof in Theorem 

= R, and E 



Then \\Q t \\ < max^ 7== 



and then apply [T3] . The proof of this Rademacher 
with the modified definition of Q t = <jf 



' \/p r (i)p c (J) 



\J2t=i QtQf 



lap 



' P 



P(i-j) 

'( i )p c (i) 



< 



Ep(i>j) 



< Asm. 



Similarly, E ||5^*_iQ('Qt|| < Asn. Setting a = \/Asn and applying [T4"], we obtain the result. □ 
Moving from Theorem [T] to Theorem [3j we are competing with a different class of matrices: 



inf LJX) 
xew r [p] p 



inf LJX) 

X£W r [p] 



In most applications we can think of, this change is not significant. For example, we consider the low- 
rank matrix reconstruction problem, where the trace-norm bound is used as a surrogate for rank. In 
order for the (squared) weighted trace-norm to be a lower bound on the rank, we would need to assume 



diag (p r ) 1 ^ 2 Xdiag (p c 



1/2 



< 1 [TO]. If we also assume that ||(JC*)(j)|| 2 < m and 



(i) || < n for 



then X* G W r [p] 



2 

F _ " ■ —" 2 

all rows i and columns j — i.e. the row and column magnitudes are not "spiky 
Note that this condition is much weaker than placing a spikiness condition on X* itself, e.g. requiring 

3.2 Results on Netflix and MovieLens Datasets 

We evaluated different models on two publicly-available collaborative filtering datasets: Netflix [15] and 
MovieLens [TB]. The Netflix dataset consists of 100,480,507 ratings from 480,189 users on 17,770 movies. 
Netflix also provides qualification set containing 1,408,395 ratings, but due to the sampling scheme, 
ratings from users with few ratings are overrepresented relative to the training set. To avoid dealing with 
different training and test distributions, we also created our own validation and test sets, each containing 
100,000 ratings set aside from the training set. The MovieLens dataset contains 10,000,054 ratings from 
71,567 users and 10,681 movies. We again set aside test and validation sets of 100,000 ratings. Ratings 
were normalized to be zero-mean. 

When dealing with large datasets the most practical way to fit trace-norm regularized models is via 
stochastic gradient descent [TTl [2j |TTJ . For computational reasons, however, we consider rank-truncated 
trace-norm minimization, by optimizing within the restricted class {X : X £ W r [p] ,rank(X) < k} for 
k = 30 and k = 100, and for various values of smoothing parameters a (as in (|3|). For each value of a 
and k, the regularization parameter was chosen by cross-validation. 

The following table shows root mean squared error (RMSE) for the experiments. For both k=30 
and k=100 the weighted trace-norm with smoothing significantly outperforms the weighted trace- norm 
without smoothing (a = 1), even on the differently-sampled Netflix qualification set. We also note 
that the proposed weighted trace-norm with smoothing outperforms max- norm regularization |18j . and 
compares favorably with the "geometric" smoothing used by |11) as a heuristic, without theoretical or 
conceptual justification. A moderate value of a = 0.9 seems consistently good. 









Netflix 








MovieLens 




a 


k 


Test 


Qual 


k 


Test 


Qual 


k 


Test 


k 


Test 


1 


30 


0.7604 


0.9107 


100 


0.7404 


0.9078 


30 


0.7852 


100 


0.7821 


0.9 


30 


0.7589 


0.9096 


100 


0.7391 


0.9068 


30 


0.7831 


100 


0.7798 


0.5 


30 


0.7601 


0.9173 


100 


0.7419 


0.9161 


30 


0.7836 


100 


0.7815 


0.3 


30 


0.7712 


0.9198 


100 


0.7528 


0.9207 


30 


0.7864 


100 


0.7871 





30 


0.7887 


0.9249 


100 


0.7659 


0.9236 


30 


0.7997 


100 


0.7987 
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4 The empirically-weighted trace norm 

In practice, the sampling distribution p is not known exactly — it can only be estimated via the locations 
of the entries which are observed in the sample. Defining the empirical marginals 

f(i)= * {t:it = i} , F(j) = * {t:jt = j} , 

s s 

we would like to give a learning guarantee when X$ is estimated via regularization on the p-weighted 
trace-norm, rather than the p-weighted trace-norm. 



In Section 4.1 we give bounds on excess error when learning with smoothed empirical marginals, 
which show that there is no theoretical disadvantage as compared to learning with the smoothed true 
marginals. In fact, we provide evidence th at s uggests there might even be an advantage to using the 
empirical marginals. To this end, in Section [4.2[ we introduce the transductive learning setting, and give 
a result based on the empirical ma rginals which implies a sample complexity bound that is better by 



a factor of log^ 2 (n). In Section 4.3 we show that in low-rank matrix reconstruction simulations, using 
empirical marginals is indeed yields better reconstructions. 

4.1 Guarantee for the standard (inductive) setting 

We first show that when learning with the smoothed empirical marginals, defined as 

we can obtain the same guarantee as for learning with the smoothed (true) marginals, given by p. 
Theorem 4. For an l-Lipschitz loss £, fix any matrix Y , sample size s, and any distribution p. Let 
Xs = argmin |£,s(X) : X£ W r [p]|. Then, in expectation over the training sample S drawn i.i.d. 
from the distribution p, 

t t-tr \ ^ ■ r t fv \ ^ I i I r max{n, m} logfn + m) \ ,_. 

L * {Xs) * x&m Lp{x) + { l - V —Y^ ~ ) ' (5) 

Note that although we regularize using the (smoothed) empirically-weighted trace-norm, we still 
compare ourselves to the best possible matrix in the class defined by the (smoothed) true marginals. 

The proof of the Theorem (given in Appendix [A]) uses Theorem [3] and involves showing that with a 
sample of size s = fi(nlog(n)), which is required for all Theorems so far to be meaningful, the true and 
empirical marginals are the same up to a constant factor. For this to be the case, such a sample size is 
even necessary. In fact, the log(n) factor in our analysis (e.g. in the proof of Theorem [T]) arises from the 
bound on the expected spectral norm of a matrix, which, for a diagonal matrix, is just a bound on the 
deviation of empirical frequencies. Might it be possible, then, to avoid this logarithmic factor by using 
the empirical marginals? Although we could not establish such a result in the inductive setting, we now 
turn to the transductive setting, where we could indeed obtain a better guarantee. 

4.2 Guarantee for the transductive setting 

In the transductive model, we fix a set S C [n] x [m] of size 2s, and then randomly split S into a training 
set S and a test set T of equal size s. The goal is to obtain a good estimator for the entries in T based 
on the values of the entries in S, as well as the locations (indexes) of all elements on S. We then use the 
(smoothed or unsmoothed) empirical marginals of S, for the weighted trace-norm. 

We now show that, for bounded loss, there may be a benefit to weighting with the smoothed empirical 

marginals — the sample size requirement can be lowered to s = O ( rn log 1 / 2 (n) 



Theorem 5. For an l-Lipschitz loss £ bounded by b, fix any matrix Y and sample size s. Let S C [n] x [m] 
be a fixed subset of size 2s, split uniformly at random into training and test sets S and T , each of size s. 

Let p denote the smoothed empirical marginals of S . Let Xg = argmin |L,g(A) : X G W r Then 
in expectation over the splitting of S into S and T, 

L T (X<)^ iuf £rl.\>-O|; / nl0g/i(n) +^l . («) 
xeWrM ' 1 
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This result (proved in Appendix |B| is stated in the transductive setting, with a somewhat different 
sampling procedure and evaluation criteria, but we believe the main difference is in the use of the empirical 
weights. Although it is usually straightforward to convert a transductive guarantee to an inductive one, 
the situation here is more complicated, since the hypothesis class depends on the weighting, and hence 
on the sample S. Nevertheless, we believe such a conversion might be possible, establishing a similar 
guarantee for learning with the (smoothed) empirically weighted trace-norm also in the inductive setting. 
Furthermore, by using the fact that a sample of size s — 0(nlog(n)) is sufficient for the empirical 
marginals to be close to the true marginals, it might be possible to obtain a learning guarantee for the 
true (non-empirical) weighting with a sample of size s = O (n(r log ^ 2 (n) 4- log(n))). 

Theorem [5] above can be viewed as a transductive analog to Theorem [3] (where weights are based 



on the combined sample S). In Appendix |B 



we state and prove transductive analogs also to Theorem 111 
(for the case where smoothing is not needed) and Theorem [2] (giving a cubic- root rate). As mentionea 
in Section 2.3 our lower bound examples can also be stated in the transductive setting, and thus all our 
guarantees and lower bounds can also be obtained in this setting. 

4.3 Simulations with empirical weights 

In order to numerically investigate the possible advantage of empirical weighting, we performed simula- 
tions on low-rank matrix reconstruction under uniform sampling with the unweighted, and the smoothed 
empirically weighted, trace-norms. We choose to work with uniform sampling in order to emphasize the 
benefit of empirical weights, even in situations where one might not consider to use any weights at all. 
In all the experiments, we attempt to reconstruct a possibly noisy, random rank-2 "signal" matrix M 
with singular values ^=(n, n, 0, . . . , 0), ensuring = n, measuring error using the squared lost|* 

Simulations were performed using Matlab, with code adapted from the SoftImpute code developec 
by |19j . We performed two types of simulations: 

Sample complexity comparison in the noiseless setting: We define Y = M, and compute X$ — 

argmin j||A|| : L S (X) = oj, where \\X\\ = \\X\\ tI or = ||-X"|| tr ^ r ^ ), as appropriate. In Figure pla), we 

plot the average number of samples per row needed to get average squared error (over 100 repetitions) 
of at most 0.1, with both uniform weighting and empirical weighting. 



Excess error comparison in the noiseless and noisy settings: We define Y = M-\ 

N has i.i.d. standard normal entries. We compute X$ = argmin ||| A|| : Ls(X) < i/ 2 | 

we plot the resulting average squared error (over 100 repetitions) over a range of sample sizes s and noise 
levels with both uniform weighting and empirical weighting. 

The results from both experiments show a significant benefit to using the empirical marginals 



uN, where noise 
In Figure jl|b), 




200 

Matrix dimension n 



0.8 



ra 0.2 
< 




* v=0.4, true p 

■!■ v-0.4, empirical p 
— * — v=0.2, true p 
— I — v=0.2, empirical p 

* - v=0.0, true p 

H — v-0.0, empirical p 



500 



1000 
Sample size s 



Figure 1: (a) Left: Sample size needed to obtain avg. error 0.1, with respect to n. (b) Right: Excess error 
level over a range of sample sizes, for fixed n = 200. (Axes are on a logarithmic scale.) 



1 Although the squared loss is Lipschitz in a bounded domain, it is probably possible to improve all our results 
(removing the square root) in the special case of the squared loss, possibly with the additional assumption of 
i.i.d. noise , as in [5]. 
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5 Discussion 



In this paper, we prove learning guarantees for the weighted trace-norm by analyzing expected Rademacher 
complexities. We show that weighting with smoothed marginals eliminates degenerate scenarios that can 
arise in the case of a non-product sampling distribution, and demonstrate in experiments on the Netfiix 
and MovieLens datasets that this correction can be useful in applied settings. We also give results for 
empirically-weighted trace-norm regularization, and see indications that using the empirical distribution 
may be better than using the true distribution, even if it is available. 
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A Proofs for the i.i.d. sampling setting 
A.l Proof of Theorem [TJ 

We first fill in the details for the Rademacher bound in the case that p has uniform row- and column- 
marginals. Define 



Qt = c f - 



y/p r (i t )p C (jt) 

We need to calculate R and a 2 such that ||Qt|| sp < R (almost surely) and 

<7 2 = max{||y;E[QfQ t ] ,||VE[Q t Qj] 1 
L 11 — s p 11 — s pJ 



For each t, Q t is just a matrix with a single non-zero entry of magnitude 



y/p r (i)p"U) 



, for some i,j, and 



so 



h L„ < max,-, 7 = 



R. 



The matrix QtQf & M. nxn is equal to pT .,^' e ^ with probability p (i,j). Hence E [QfQ t ] is a diagonal 

matrix with entries ^jjj^qjj- Similar arguments apply to QfQt- Multiplying by s, and recalling the 
spectral norm of a diagonal matrix is simply the maximal magnitude element, we have: 



a = s ■ max < max 



E 



P(i,j) 



max 



E 



P r {i)p c ti) 



This completes the proof for the case that p has uniform row- and column- marginals. 

Next we turn to the case that p is a product distribution, p = p r x p c (with possibly non-uniform 
marginals). For any X g WV [p], define 

logH 

sJnm 



Z(X)= [X l3 l\p{i,j)> 



Let Z = {Z(X) :XeW r [p}}. 

We can then follow the proof of the bound in the uniform-marginals case, with a modified definition 
of Q t : 

e iujt l{p(i ujt )>^} 
Qt = &t- 



\/p r {it)P c tit) 

Proceeding as in the proof for Theorem jlj we obtain R < 



log(n) 



Ec 



O 



rn log(n) 



Therefore, by [T5] . 



E 



E 



sup L p (Z(X))-L s {Z(X)) 
xew r [p] 

sup L S {Z(X)) - L p (Z(X)) 
xew r [ P ] 



< O \l 



<o\i 



rnlog(n) 
s 

rn log(n) 
s 
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Next, let I = (yp(i,j)l [p{i,j) < ^7=}) ..■ For any matrix M, define 



diag(p r ) 1/2 Mdiag(p c ) 



L/2 



Now take any M with ||M|| F(pr pC) < 1. Let M' = diag (p r ) 1/2 Mdiag (p c ) 1/2 , then \\M'\\ F < 1. We have 



p(i,j)Mii =J2^ M 'ij = < 7 ' M '> ^ VWf ■ W m '\\f 



ir-p(i,j)< 



< 



1°B(") 



\ 



log(n) 1 . / log(n) 
1 < 4 / nm ■ — ; 



swnm 



swnm 



/nm log (n) 



Since ||M|| F < ||M|| tr for any matrix M, we then have, for any X e W r [p], II A|| F(pr pC) < 

I|-X"|ltr(pr,pc) ^ V^. and S0 



|L P (X)-L P (Z(X))| = 



Ep«.i)(^«>^)-^,^)) 



<i - 2^p{%,3)\Xn\ < y ■ 



And, hxing some X* G W r [p] such that L p (X*) = mf X ew r \p] L P (X), 



E 



sup - L S (X) 

xew r [p] 



E 



L S (X*) - L S (Z(X*)) 



= E 



sup -i^HiitJt) ?l}(l(0,Y itjt )-£(X itjt ,Y itjt )) 
xew r [p] s t=1 

\jZmu]t)^I} (£(X* tjt ,Y itjt ) -£(0,Y itjt )) 



+ E 



= E 



< E 



sup ^HiitJ^^itiX^Y^-tiXi^Y^)) 



sup l -i^l{{i u j t )tl}l(X* tjt ,Y it 
xeWrM s t=1 



< I- E 



l -Y^l{{iujt)^T}\Xt 



i ■ e [i {(h, n) $ 1} \x; in \]=i.^ P (i, j) 1 < 



l 2 ry/nm log(n) 



Then writing 

L P (X S ) - L P (X*) = (L P (X S ) - L P (Z(X S ))) + (L P (Z(X S )) - L S {Z{X S ))) + (L S (Z(X S )) - L S (X S )) 

+ (L S (X S ) - L S (X*)) + (L S (X*) - L S (Z(X*))) + (L S (Z(X*)) - L p {Z{X*))) + (L p (Z(X*)) - L p {X*)) 
we obtain 

IPrn log(n) 



E 



L p {X s )-L p {X*) 



< O 
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A.2 Proof of Theorem [2] 

Assume I is /-Lipschitz and 6-bounded, and r > 1. We will show that (for any p) 



E 



K s (foW r [p]) =0 (Z + 6) 



3 /rn log(n) 



Given a sample S 1 , define 



IS = {* : f «*) or P c (i t ) < y 1 ^!^} , r| = {1, ■ • • , s}\T°s 



We have 



R s (<oW r [ P ])=E„„ {±1) , 



I xii 



1 s 

sup ~22a t -e(X itjt ,Y it 



< E„ 



SUP - ^ (T t -^i t j t ,^ t j t ) 



t£T° 



sup ~ at ■ l ( X idt> Y uh) 



Bounding the first term, 



E„ 









b ItOI 











In expectation over 5, 



E.c 



It" I 



6 • Ei 



I { p r (i) or p c (j) < 



3 l 2 rlog(n) 



b 2 sn 2 



< 



b- E E p (*>•?) 



b- 



E Eh^') 



i:p'(i)<fiP 

E p c w 



rp c (i)< V 



Hrlog(n) 
f>2 s „2 



b 2 sn 2 



b 2 i 



To bound the second term, we use the fact that ||abs(X)|| tr < ||-X"|| tr for any matrix X, where abs(X) 
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is the matrix defined via abs(X)ij = \Xij\. We have 



I x 



<E„ 



= E„ 



sup \ E °t ■ t(X itjt ,Y idt ) 

sup I'ZvfWwYuJ-mYuJ 



tr(p r ,p")^Vr teT i 



sup \Y,°fmYnn) 

\\XK< v T.„c,<y/? 8 



sup ^<7 t -wi itjll y iljt )^(o,y il 



ll*W. P e)<VF s t6r , 



lltr(pr,pC)-V' teT l 

Sup - a t ' l-^i- 

l^lltr(,r, p e ) <Vi :S teT l 



.. sup Je 



0t 



\K 



HJt I 



<Z-E„ 



II*" II 



, sup Je 



0"t 



r<v? s ^ (^)p c Ut) ' Xitjt 



E-* — 1 



t=l 



\/p r («*)p c (it) 



Defining Q t = 0-4 - , , we can follow identical arguments as in the proof of the 

\/p T (h)p c {jt) 

first bound of this theorem. We have 



IIQtlLp < max 
1 ii 



sjv r {i)v c (j) 



< 



l 2 r\og(n) 



= R 



and 



a 2 =max(|VE[QfQ t ] ,|VE[Q t Qf] ) 



< s ■ max < 



max 



E 



P r (i) P c U) 



max 

j 



E 



(i,jn{p r (i),P c u)> f^p-Y 



p r 0)p c (j) 



< s ■ 



b 2 sn 2 i ^p{i,j) \^p{i,j) 



p r log(n) 



max < max 



r— , max > 



6 2 i 



p r (?) ' i p r [i) \ y £ 2 r log(n) 
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Then applying [TJ], we get 



x II 



SU P °f t{X ltn ,Y ttjt ) 



)<Vr 



E 



S,<7 



sp 



< O ^ ( < 7 V /log(n) + i?log(n) 



< O 



/ 6 2 s% 2 
;2 r log(n) 



Vlog(n) + 



/ b 2 sn 2 
l 2 r log(n 



log(ra) 



< 



If s > rnlog(n), then this proves the bound. If not, then the result is trivial, since L p (X) < b for any X. 
A. 3 Proof of Theorem H] 

Throughout this section, assume s > 24nlog(n). (If this is not the case, then we only need to prove 
excess error < 0(ly/r), which is trivial given the class W r [p].) We also assume s < O (nmlog(nm)). (If 
this is not the case, then with high probability, we observe all entries of the matrix and obtain optimal 
recovery.) The lemmas which are cited in this proof, are proved below. 
Define 

r 



X* = arg min L V (X), 
xew r [p] 

For any sample S, define 

c(S) = max jo, 

Then, for a fixed S, 

\\(l-c(S))X*\\ tl{rr) = V^(l~c(S)) 
Applying Lemma [T] and Theorem [3j 



X* 



tr{p r ,p c ) 



,X" 



< yfi (1 - c{S))X* G W r [p] 



E 



L P (X S ) - L S (X S 



< E 



sup (L P {X)-L S {X) 

xew r [p] 



< E 

< o 

And, similarly, 



sup [L P {X)-L S {X) 

X£2-W r [p] 



8x/P 



< O 



l 2 m log(n) \ %\Jl 2 mm 



l 2 rn log(n) 



E 



L s ((l-c(S))X*)-L p ((l-c(S))X*) 



< E 



sup (Ls(X)-L p (X) 
xew r [p] K 



< E 



< O 



sup 

X€2-W r [P] 



(L S (X) - L P (X) 



8VT 2 



< o 



l 2 rnlog(n)\ &\/l 2 rnm 



l 2 rn log(n) 
s 
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By definition, since (1 - c(S))X* £ W r [p], 



E 



L S (X S ) - L s ((l - c(S))X*) 



< 



Finally, by Lemma [3j 



E[L p ((l-c(S))X*)-L p (X*)] < 



2l 2 rn 



Combining all of the above, we get 



E 



L P (X S ) - L P (X*) 



< O 



l 2 rn log(n) 



A. 3.1 Lemmas for Theorem 6 
Lemma 1. 



E 



E 



sup [L P (X)-L S {X) 
xew r [p] 



sup (l s (X) - L P (X)) 

xew r \p] K ' 



< E 



< E 



sup (l p (X)-L s (X) 

XG2-W r [p] V 



sup (l s (X)~L p (X) 

XG2-W r [p] V 



mm 
n 2 



Proof. By Lemma [2j with probability at least 1 — 2n 2 , for all 

f (i) > \f (*) , P c U) > \v c U) ■ 

Let A be the event that these inequalities hold. If A occurs, then for any X £ W r [P], 
\\X\ 



diag(p''(z)) 1 / 2 Xdiag(p c (j)) 1/2 
diag 



! '' /2 diag(^( i )) 1/2 Xdiag(p c ( J )) 1/2 diar^ / (/P 



< 2 



P r (t) 

diag(jT ( J )) 1/2 Xdiag(p c (j)) V2 



2IIXI 



tr(25 r ,p c ) 



P° (j) 



< 2Vr 



tr 



In this case, W r \p] C 2 • WV [p], and therefore, 

(Z p (X) - L p (0 nxm )) — (^L S (X) — L s (O nxm )^j 
(L p {X) - L p (0 



sup 

xew r [p] 



< sup 

Xe2-W r [p] 



Next we consider the case that A does not occur. For any X £ W r [P], 



\X\oo < \\X\\ F = 2V^ 



1 \ 1/2 / 1 ^ 1/2 

diag ( -l n j *diag(— 1 



< 2\/nm 



1 \ 1/2 / 1 ^ 1/2 

diag[-l„ *diag( — 1 



< 2 % /^m||X|| tr(r ^ ) < l^rnm 
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Therefore, 



sup 

xew r \p] 



< sup 

xew r \p] 



< I ■ sup 

xew r [p] 



< I ■ sup 

xew r [P] 



(L p (X) - L p (0 nxm )) - (t s (X) - L s (0„ xm )) 

Y^Pihj) W^Yh) - £(0, Y l3 )) -\Y, m H3t ,Y HH ) £(0, Y lt3t )) 

Y,p{hj)-\Xij\ + \Y,\ x idt 



■ 2y/rnm + - ^ 2y/r 



rnm 



And so 



sup 

xew r [p] 



= Ec 



sup 

xew r [P] 



+ E £ 



- L p (0„ xro )) - (i s P0 - i s (o„ xm )) 

(L p (X) — L p (O n xm)) — nxm 

) -I{A} 

(L P (X) - L p (0 nxm )) - (L S (X) - Ls(0„xm))] • I {A c } 



sup 

xew r [p] 



< Ec 



sup 

xe2-w r [p] 



sup 

xe2-w r [P] 



sup 

X£2-W r [p] 



(L P (X) - L p (0 nxm )) - (L S (X) - L s (0 nxm ))] • I {A} 
(L P (X) - L p (0 nxm )) - (L S (X) - L s {0 nxm ))] ■ I {^4} 



+ P {A c ) ■ 4VP rnm 



+ 



8VT< 



rnm 



(L P (X) - L p (0 nxm )) - (L S (X) - L s (O nxm j) 



+ 



8VT< 



rnm 



where the last step is true because, since nxm G 2 • W r \p], for any S, 



sup 

xe2-w r [p] 



(L p (X) - L p (0 nxm )) - (L S (X) - L s (O nxm )^j 



> . 



And, E s 



L p (Q nxm ) — Ls(O nxm ) 



0, so therefore, 



E, 



sup (L p (X) - L s (xj) 
xew r {P] v 7 



< E 



sup (l p (X) - L S (X)) 

XG2-W r [p] V 7 



8Vl 2 
n 



rnm 

2 



The second claim can be proved with identical arguments. 
Lemma 2. With probability at least 1 — 2n~ 2 , for all i and all j, 

f (i) > \v r (i) , P c U) > \p c (j) • 

Proof. Take any row i. Suppose that p r (i) < K Then p r (i) < i, while p r (i) = | (p r (i) + i) > 
Therefore, in this case, p r (i) > \p r (i) with probability 1. 



1G 



Next, suppose that p r (i) > ~. Then, by the Chernoff inequality, 

P (f (») < \f (*)) = P (Bin( S ,/ (»)) < S p r (i) (l " ^)) < e_££ ^ 
< e"^ < e~ 3log(n) = n- 3 . 
Therefore, with probability at least 1 — n~ 3 , p r (i) > \p r (i), and so 

r(0 = i(r(0 + i)>^(^(0 + i)>^(0. 

Therefore, for any row i, with probability at least 1 — n~ 3 , p r (i) > \p r {i). The same reasoning 
applies to every column j. Therefore, with probability at least 1 — 2nT 2 , the statement holds for all i 



and all j. 

Lemma 3. Fix X* with \\X*\\ tT (p r = r* < r, and define 

1 



c(S) = max { 0, 



tr(p r ,p°) 



Then 



Proof. 



2l 2 i 



E[L p ((l-c(S))X*)-L p (X*)]< 

L p ((l - c(S))X*) - L P (X*) = £>(t,i) (£((1 - c(5))^,y«) 
< I ■ X)P(«,J) 1(1 - - X£| = i . c(5) • $>(i, j) 



= I.c(S)-E 



P(i,j) 
\/v r (i)f> c (j) 



■V¥WpW)-\x*a 



□ 



Defining M 



= I ■ c(S) • (M, (diag (pr (z)) 1 / 2 X*diag (p c (j)f 2 

< I ■ c(S) ■ \\M\\ sp ■ (diag (f (z)) 1/2 X*diag (p c (j)) ±/2 

<lVr~-c(S)-\\M\\ sp . 
Now we show that ||M|| sp < 2. Take any unit vectors u € E m , v € R™. Then 

-j S> w ■ P 4> + 5 S> w - P 4) £ 5 ? 2 " ? + 5 ? 2 ^ = 2 



So, by Lemma |1J 



E [L p ((l - c(5))X*) - L P (X*)] < 2Zy^ • E [c(5)] < 2/^F • 



□ 
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Lemma 4. For any p, for any fixed X with \\X\\ tx (~ r pc\ = 1, 



E 



m«{0, HXH^r^ - 1} 



< 



Proof. By properties of the trace-norm [5] , we can write diag (p r f /2 Xdiag {p c f /2 = AB T , where ||A| 
II^IIf = ll^lltrfa,^) = L Dcfine 



L>i = diag (p r ) diag (p r ) 1 , D 2 = diag (p c ) diag (p c ) 1 



Then, by properties of the trace-norm [5], 



ll^lltr^^) 

< I WdV'a 



diag(p r ) 1/2 Xdiag(p c ) 

2 



1/2 



2 l 

F + 2 



DfB\ 



2 V P r (*) 



Ivf + iy || 2 , 1 V ^ + - 



B y)ll2 . 



4^ sp c (j) 



where N[ is the number of samples in row i, and N? is the number of samples in column j. Clearly, 



E 



1 y A", - r 



4 ^ sp r (i) 

4 4^ sp r (i) 

1 2sp r (i) 
4 ^ sp r (i) 

1 



1 ^ N c + — 

^)iii + iE^fii^)i 



1-4 



(i)ll2 



sp c (j) 

i S ^ C is) + 



E 



sp c (j) 



l^coll 



A wiil + iE 



1 ^ 2sp c (j) , 



4 ^ sp c (j) 



5 IK + 5 



11* = i- 



And, we can compute 



Var(iVT) < SJ5 r (») , Cov(NT,N[,) < 0, Var(A^ c ) < sp c (j) , Cov(A^ c , JVJ>) < 
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Therefore, 

+ E ,^0) Var(JVJ)||B w ||5 + 2 £ ^ ^f ^^NpnB^fJB^ 

s £ ^^VMfArnii^nj + E ^v M (^)||B 0) |li 
^E^« + E$t»^ 



Since p r (i) > (£) and p r (i) > g— , and similarly for the columns, we continue: 

i j \ i / \ j 



<g|| A ||* + jg|| B ||< < 4(n + m) 



So, we have 



E[max{0,||X|| tr(r)r) -l} 
< E 



oIe^Im 



sp r (i) 



(i)ll2 



1 /V c J — *. 



4^ sp-(j) 



ll-B^III-l 



< 



\ 



MzE^ii« + ;E^fii%>ii 



\ 



4 ^ sp c (j) 



\ B U)W 



< 



(n + m) 



4s 



□ 



B Proofs for the transductive setting 
B . 1 Proof of Theorem [U 

Let S C [n] x [m] be a subset of size 2s. Let p denote the smoothed empirical marginals of S. 

Now choose any ScS,a training set of size s. Without loss of generality, write S = {(«i, ji), ■ ■ . , (i2«? J2«)} 
and S* = {(ii.ii), (i a ,j s )}- 
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First, we bound transductive Rademacher complexity. By Lemma 12 in [5], for any sample S, 



sup - V a t X it3t 
xew r [p] s t=1 



E. 



er~{±l} s 



sup 

xew r [p] 



< E 



a~{±iy 



sup 



y \t:(lt;3t) = (i,j) ) 

- ^ Xijaa ■ #{t : (it,jt) = 1 < * < s} 



E 



<7~{±1}* 



sup - ^ XjjCTij j) e S} 



Now define matrix E via 



Ey — 



We have 



E 



E, 



Ec 



E 



<7~{±1}* 



E 



<7~{±1}* 



E 



<7~{±l}" 



sup - y^XijVij j) e S} 
xew r ffl s -■ 



sup V (y / p r (i)Xij^/p c (j)) 
xew T [p] a K ' 



SUp AjjOijEjj 
X:||A:|| tr <V? 



/r-E.c 



E 



CT ~{±i}- 



where a • E is the element-wise product of E with the random sign matrix a = (cry). By 



E 



<T~{±1}* 



W • S|| sp < O ^log 1 ^ 4 (n + m)j ■ max < max ||E(,) || 2 , max 



We now bound ||E(j)||„ and ||E^|L- Fix any i. Then 



any 



, ||2 = v ^ 2= f. H(i,j)eS} ^ l{(i,j)eS} 



(Sjf (»)) • ( S p c (j')) " ^ (*)) ■ (« • 2S) 

#{< : i t = i, 1 < * < 2s} 



^ #{t:(»t,Jt) = (i,j),l<t<2s} 



< 



8m 



lt = z,l<t < 2s })-( S .^) " ( 1 I #{t:it=iA<t<2s})-(s-±) ~ s 



Similarly, for all j, ||E0')|| 2 < ^. Therefore, 



E 



S~p 



E, 



CT ~{±i}-> 



|a.£|| 



O (log 1 / 4 ^) 



• max s max WSu) L ,max 



< O 



Vn log' 2 (n) 



Applying Theorem 5 of |12j (using integration to obtain a bound in expectation from a bound in 
probability) , 



Eg 



%\s(^)- x mf H%vs (X) 



< O 



Z 2 rn log^ 2 (n) + b' 2 
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B.2 Transductive version of Theorem [T] 

Let p now denote the (unsmoothed) empirical marginals of S. If p r (i) > ^ and p c (j) > <^ for all 
denning 

X s = arg min L S (X) , 

xew r [p] 

we can then show that, for an Z-Lipschitz loss £ bounded by b, in expectation over the split of S into 
training set 5* and test set T, 



L T {X S ) < inf L T (X) + O C 1/2 l 
xew r [p] ' 



rn log^ 2 (n) + b 2 



We prove this by following identical arguments as in the proof of Theorem [5] we define 

H(i,j)eS} 



and obtain ||£(i)||2 , ||EW|| 9 < for all i, j, which yields 



Ec 



< O 



1 Cl 2 rn\og 1/2 {n) + b 2 



In fact, we can obtain the same result with a weaker requirement on p, namely 



max i max ||S(j) |||, max ||S^ ||| } < max < max 



m 1 



' in t~^ P r (i)P c (J) i n T< P r {i)P c {j) 



For instance, this quantity is likely to be bounded if S is a sample drawn from a product distribution on 
the matrix. 

B.3 Transductive version of Theorem [2] 

Let p now denote the (unsmoothed) empirical marginals of S. We define 

X s = arg min L S (X) , 

we can then show that, for an ^-Lipschitz loss £ bounded by 6, without any requirements on p, in 
expectation over the split of S into training set S and test set T, 



Lt(X s )< inf L T (X) + \(l + b) 
xew r [P] \ 



3 /rnlog(n) 



We prove this by combining the proof techniques used in the proofs of Theorems [2] and [5] Define 



?S = <t : l<t<2s,p r (i t ) ovp c ( Jt ) < 



3 l 2 r log(n) 



b 2 sn 2 



,Tj = {!,... ,2s}\T$ 
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We then have 

Rs(i°W r |)=E H±1} , 



1 s 

sup - 22a t e(X itjt ,Y itjt ) 
xew r [p] s t=1 



< E 



<T~{±1}"> 



sup -^l{X ltHl Y M3t )a v) ■!{(%, j) G 5} 



E„ 



sup -Y'^^-^^-Jcry -li (tj) G 5, andp r (i),p c (j) > 

sup - V £{X itjt , Y itjt )crij ■ I < (i, j) G S, and p r (i) or p c (j) < 



3 l 2 r log(n) 



Z 2 r log 



ft 2 .' 



= (Term 1) + (Term 2);. 
Now define matrix £ via 



Following the same arguments as in the proof of Theorem [5j we obtain for all 

\m\i\\^\\i<-- J ^ 



s y Z 2 rlog(n) 

Therefore, using the same arguments as in the proof of Theorem [2] 

/ 



(Term 1) < ly/FO 

Next we have 
(Term 2) = E CT 



log V4 (n), 



b 2 sn 2 



l 2 brn log(n) 



^ s y Z 2 rlog(n) 

sup 1 V e(X itjt , Y itjt )<Tij ■ 1 < G S, and p r (i) or p c (j) < 
=W r m s Tj { 

< sup - V t(X lt n , Y Hn &S, and p r (i) or p c (j) < 
xew r [P] 3 ~j { 

< ■ 1 1 (i, j) G 5, and |f (») or p c (j) < 



l 2 r log(n) 
b 2 sn 2 



l 2 rlog(n) 
b 2 sn 2 



3 ll 2 rlog(n) 



b 2 sn 2 



< 



I £ £ fo H £ £ * 



< — \ b ■ 2s ■ 3 / ?2rl °g( n ) | < O I 3 / ?2r&n lp g( n ) 
~ s V & 2 sn 2 / — I V s 



3 / l 2 rbn log(n) 



Combining the two, we get 1Z S (£ o W r [p]) < O 
the split of S into S and T, 



Lt(X s ) < inf LtW + O « + M 



, and therefore, in expectation over 



3 rn log(n) 
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